On an Improvement over Renyi 's 
Equivocation Bound 
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Abstract — We consider the problem of estimating the prob- 
ability of error in multi-hypothesis testing when MAP criterion 
is used. This probability, which is also known as the Bayes 
risk is an important measure in many communication and 
information theory problems. In general, the exact Bayes risk 
can be difficult to obtain. Many upper and lower bounds are 
known in literature. One such upper bound is the equivocation 
bound due to Renyi which is of great philosophical interest 
because it connects the Bayes risk to conditional entropy. Here 
we give a simple derivation for an improved equivocation 
bound. 

We then give some typical examples of problems where 
these bounds can be of use. We first consider a binary 
hypothesis testing problem for which the exact Bayes risk is 
difficult to derive. In such problems bounds are of interest. 
Furthermore using the bounds on Bayes risk derived in the 
paper and a random coding argument, we prove a lower bound 
on equivocation valid for most random codes over memoryless 
channels. 

I. Introduction 
In his celebrated paper of 1948, Shannon proved the 
Channel Coding Theorem. This theorem essentially states 
that the ensemble of long random block codes (and thus 
some specific code) in the limit of very large block lengths, 
achieves an arbitrarily low probability of error under de- 
coding by jointly typical decision rule, when used over 
a given channel at information rates below a limit called 
the channel's Shannon capacity. It is well known that for 
minimizing the Bayes risk, the optimal decision rule is 
the Maximum Aposteriori Probability (MAP) decision rule. 
Shannon uses jointly typical decision rule in his analysis 
because, asymptotically the decision rule is optimal and it 
simplifies the analysis considerably. The strong converse 
to the channel coding theorem based on Fano's inequality 
states that the probability of error under any decision rule 
approaches 1 exponentially as block length increases when 
rate is above capacity. 

The Shannon capacity of a discrete memoryless channel 
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(DMC) is given by, 

C = max J(X;Y) 

where I(X; Y) is the mutual information between the channel 
input X and channel output Y. The mutual information is 
given in terms of entropy function as, 



7(X;Y) = H(X)-H(X|Y) 

= EM^)iog 4^ 



(1) 



The source entropy H(X) is a function of the source 
statistics. The function H(X|Y) is called the conditional 
entropy or equivocation. Equivocation is dependent on the 
channel statistics as well as the properties of the channel 
code employed. For most non-trivial channels, computation 
of capacity is infeasible due to the optimization required 
over the input probability distribution of a highly nonlinear 
function. Good upper and lower bounds to capacity which 
are easy to compute are therefore of interest. A useful lower- 
bound on capacity is clearly the mutual information for 
some arbitrary p(x). Upper bounds usually require other 
formulations. 

The decoding problem for codes is an instance of the 
more general problem of multiple hypothesis testing which 
appears in some form in most fields of science. It is intuitive 
to say that the probability of error under the Bayes decision 
rule is a function of equivocation. That this is true was 
proved rigorously by Renyi in [R66]. Among other things, 
he showed that P e ^ H(X|Y). Hellman and Raviv later 
improved on this result in [HR70] and showed that in fact, 
Pe ^ 2_FJ (X| Y). It is immediately clear that even this 
improved bound is extremely loose when the equivocation 
is over unity. 

In this paper we first look at several tight classical 
bounds on the Bayes risk in the general multi-hypothesis 
testing problem. While these bounds where available in the 
literature, they have not found widespread application in 
communication theory. We give a simple binary hypothesis 
testing problem where such bounds will be very helpful 



in analyzing the optimal decision rule. We then derive a 
new upper bound on probability of error in multi-hypothesis 
testing of the form 

-H(X|Y) 



Pe < 1 - T 



(2) 



which like the equivocation bound [R66, HR70] relates P e to 
the conditional entropy. But unlike the classical equivocation 
bounds, the new bound is always bounded below 1 and never 
gets too loose to be uninformative. 

Next we use these bounds and a random coding [Gal65, 
SGB67] argument to obtain a sphere packing lower bound on 
probability of error under MAP of the ensemble of random 
codes for any channel in a subsequent section. Then we 
specialize it to the case of a memoryless channel to obtain 
a lower bound on equivocation for most random codes, 



H(X|Y)>£(R-p) 



(3) 



where N is the block length of a rate R code and p is 
a function of the apriori input probability distribution and 
the channel likelihood function. For a discrete memoryless 
channel, 



Ppxti = 21 °§2 




k€K 



where the DMC channel transition function given by Py\ x (-), 
while p x (-) is some probability distribution on the input 
alphabet and K, and J are the input and output alphabets 
respectively. This also leads us to an upper bound on the 
mutual information and hence the capacity of such channels. 
For a discrete memoryless channel C ^ maXp x Q \ Pp x (.)}- 

In the next section, we derive some tight bounds on the 
probability of error under MAP. Some of these bounds are 
well known [Vaj68, Tou72, Dev74]. 

II. Bounds on Error Probability under MAP 
Criterion 

Consider a M-ary hypothesis testing problem. Let our 
M hypotheses be denoted as {hj : i G {1,2, . . .,M}} and 
their corresponding apriori probabilities be given by {7T; : 
i G {1,2, . . .,M}}. Also let the noisy observation be y. For 
MAP decision decoding, the conditional probability of error 
is, 



- e\y 



max P(hi\y) 

i G {1,2,— ,1] 



while LiP{hi\y) 



A. Bounds on Probability of error for binary hypothesis 
testing 

We begin by looking at the binary hypothesis problem. 
If we use the MAP criterion, the average probability of error 
is given by [HR70], 

P e = E,[l- max P(hi\y)] = E y [ xmn P{hi\y)} 



for the two hypothesis case. By an application of the well 
known weighted geometric mean inequality, we immediately 
obtain the upper bound: 



Eyl mui P(*i|y)] 
min EylPihy \yYP{h 2 \y) 



(l-*)l 



which is the popular Chernoff bound [Che52]. For the special 
case of a = 1/2, this reduces to the Bhattacharyaa bound 
[Kai67]. The Chernoff bound is not particularly convenient to 
use due to the required optimization outside the expectation, 
while the Bhattacharyaa bound is very loose. 

Using the negative power mean inequalities, we can do 
much better. We have for any [5 < 0, 

P e = Eyl mm P(hi\y)] 
i e {1,2} 



<: 



2" 1/ ^[(P(/i 1 |j^ + P(/i 2 |jO' 3 ) 1/ ' 3 ] 



While the bound gets tighter as /5 — > — oo, for most 
practical purposes, we can limit to the case /3 = — 1, which 
corresponds to the harmonic mean. After simplifications, we 
have, 

HM(P(*i|y),P(*2|y)) = 2P(*i|y)P(A 2 |y) 

= 1-P(h 1 \y) 2 -P(h 2 \y) 2 

where HM denotes the harmonic mean. So, we have the 
following pair of upper and lower bounds on the conditional 
probability of error, P e \y\ 

P(hi\y)P(h 2 \y) < P e \ y < 2P(*ily)P(*2|y) 



and for P e : 



P LB < P e < 2P 



LB 



(4) 



Fr (hj)'P?(h 1 )P(y\h 1 )P{y\h 1 



dy. It should be 



i€U2} 



;g{i,2> 



def 

where P LB = J y Pr (h^Piy^+Pr (h 2 )P{y\h 2 ! 
noted that we also obtained a convenient lower-bound on 
P(e), which is one half the upper-bound, by making use 
of the properties of harmonic means. We will refer to this 
pair as the harmonic bound. The factor of 2 guarantee in 
tightness between upper and lower bounds in the probability 
of error is usually enough for most practical applications. 
Given in Appendix I is an example of a binary hypothesis 
testing problem where the exact performance of the optimal 
decision rule is difficult to determine and bounds are useful. 

One may ask if there are M-ary extensions to the 
harmonic bound. It turns out that this is indeed the case. 
Though motivated due to other reasons, such bounds are 
well known in the literature [Vaj68, Tou72, Dev74], with 
suggested applications in multi-hypothesis pattern recogni- 
tion. We look at some of these extensions in the next two 
sections. We will also derive a new inequality and upper 
bound during the process. 



B. Some Inequalities for bounded positive sequences 

In this section we first consider a few well known 
inequalities for bounded positive valued sequences. We then 
derive a new (to the authors) inequality. In the rest of the 
section, {«/ : i G {1,2, . . .,M}} is assumed to be a discrete 
probability distribution. M is either finite or countably 
infinite. 

We will need some well known inequalities [BB61, 
Vaj68, Tou72, Dev74] for proving our main results. For the 
sake of completeness, we give a proof in the appendix. 



Lemma II. 1. 



(i) max,- {a,} < yEi^ 

(ii) max,- {a,} ^ 



(Hi) 2(1 - \/Ei«?) Da?) 



Proof. Please see Appendix II. ■ 

The following inequality is new to the authors. 
Motivated by continuity considerations, the convention 
01og 2 (0) = and 0° = 1 is adopted. 

Lemma H.2. E; «• > 2~ H ^ = Ui « ■ ' 
Proof. We use induction. 

(1) M=_l : a\ = 1 is the only possibility and claim holds. 

(2) M = m + 1 : We prove the M = (m + 1) case assuming 
that the claim is true for M = m. Consider the 
normalized sequence, a' ; = — = ^ — -j — . One may 
take a m +i 7^ 1, for otherwise, the claim is trivially true. 
By induction hypothesis, 



C. Tight Bounds on probability of error in multi-hypothesis 
testing 

One can substitute P(hj\y) for fl; in the inequalities 
derived in the previous section. Then we have the following: 



1 - /£P(fc;W 2 < Pe\y < 2 - 2 /£P(/Mj) 2 (5) 



A related pair of bounds 



E P ( /? 'l3') 2 (6) 



was first discussed in [Vaj68] in the context of Vajda's 
quadratic entropy and later by Toussaint [Tou72] who pro- 
posed the quadratic mutual information and by Devijver 
[Dev74], who popularized a closely connected measure 
called the Bayesian Distance in pattern recognition. Devijver 
also mentions the lower bound in (5). The later pair (6) can 
be thought of as an M-ary extension to the harmonic mean 
bound. 

D. An Improvement over Renyi' s Equivocation Bound 

Now we consider upper bounds relating P e with the 
equivocation. In [R66], Renyi derived the bound: 

P e]y ^ H ( P(h\y) ) (7) 

Hellman and Raviv later improved this bound in [HR70] to: 



P e[y ^ \H ( P(h\y) ) 



(8) 



These relations are not bounded and can get very loose when 
there are many hypotheses with roughly equal aposteriori 
probabilities. Using the new inequality from Lemma II. 2 we 
get: 



Pe\y < 1 - E P (*iW 2 < 1 -2" H (M) 



(9) 



;=1 i 



After some algebra, we get 



We are done if we show that x 1 + (1 — x)y C - *) ^ 
x T y when ^ x,y < 1. To see that this is true, let 



us fix < x 



f(y) ^ of + (1 - flOyd-*) - a ff y. 



a. < 1 and consider the function 

1 



Taking derivatives, f(y) = J/* 1 -"' 

11 -..(T^S) -2 



-a a and/"(y) = 
^ because ^ a < 1. So /(y) 



is a convex U function of y and has a global minimum 



of at y = a" 
This completes the proof. 



where H(.) denotes the usual entropy function. Recalling 
H( p(h\y) )], 

2 -H (P(*|3>) )] ^ j _ 2 -H(X|Y) (1Q) 



that H(X|Y) d = Ey 



where we used the fact that 2~ z is a convex U function of z 
and the Jensen's inequality. This is a new bound which relates 
equivocation to the Bayes risk. It is also an improvement over 
the Renyi and Hellman-Raviv bounds. Expanding the bound 
in (10) as a power series, 



R^l-2- H ( x l Y ) 



E(-i) 

n=l 



n+l 



(H(X|Y)ln2) n 



m 



(11) 



which is always better than the Renyi bound and at most 
a factor of In 4 < 1.4 worse than the Hellman-Raviv 
equivocation bound - this is quite acceptable for most 
purposes. While for the binary hypothesis case the new 
bound of (10) is not as tight as the equivocation bound of 
(8), as the number of hypothesis increases the equivocation 



can far exceed 1. This makes both the Renyi and Hellman- 
Raviv bounds very loose. For example when P(hi\y) = 
the Hellman-Raviv equivocation bound is not informative 
at a loose log 2 V~M, while the new bound gives a tight 

1 o-loe 9 M _ M-l 

Comparing the various bounds, the Bayesian distance 
based bounds of (5) and (6) are far tighter than both 
the conditional entropy based bounds (10), (8) and the 
well known union bound using only pairwise error event 
probabilities. In Figure 1, we can see the various bounds 
discussed above for the binary hypothesis case. 
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Figure l: Probability of error P e and various bounds on it for binary 
hypothesis testing. 

There are many instances of M-ary hypothesis testing 
in communication theory where the bounds discussed in this 
section can be valuable fundamental analysis tools. The rest 
of the paper uses only the bounds given by (5) and (10). 

III. A Random Coding Sphere Packing Lower 
Bound on P e and Equivocation 

In this section, we wish to apply the random coding 
argument [Gal65, SGB67], to obtain a lower bounds on 
the ensemble average of expected probability of error under 
MAP decoding for any channel. 

We have, 



P e — Ey[P e l y ] ^ Ey 



Now consider the ensemble of random codes. Each 
codeword in a random code in this ensemble is chosen 
independently and at random from the set of all possibilities 
with a probability of P(x). We will use the overbar to denote 
the ensemble average. The following is immediately obtained 



P, > 1 - E v 



(12) 



where we made use of the linearity property of expectation. 
There is also a corresponding upper bound: 



P„ < 2 - 2E V 



In this paper, we will not be further concerned with the above 
upper bound on P e . Instead we concentrate on inequality 
(12). 

The inequality in (12) can be further simplified when 
expanded out in terms of the input and output probability 
distributions and the channel likelihood function as follows: 



> 1-Ev 



L p (h t \y) 2 



= i-E^K/E^W 2 



i-E p M 

y 



P(ft,-)-P(y|fr 



i-EJEW-Wi) 2 



(13) 



where we used linearity of expectation in the last step. 

Due to the tightness of the bound on P e iy which we 
used initially, the ensemble average lower bound of (13) 
is also tight within a factor of 2. However, the expression 
is not easily amenable to further simplification. We now 
apply Jensen's inequality to obtain a looser yet considerably 
simpler lower bound: 



p * > i-EJEW'W 



i-EJEW-w.) 2 d4) 



Here we used the fact that yfx is a concave n function of x. 
Then by Jensen's inequality, E x [y/ f(x)] ^ y/E x [f(x)]. 

Let us also assume without loss of generality that our 
hypothesis (codeword) hj occurs with an apriori probability 
71 j. In particular for the equiprobable case, tt, = where 
M is the total number of codewords in the code under 
consideration. We get, 



p e > i-EJE^-w.-) 2 



i-EJE^-WO 2 

y V i 



as the ensemble average is independent of the particular 
hypothesis (transmitted codeword). In the above equation, 
x is a random vector drawn from the ensemble according to 
a probability distribution P(x). 

Ideally, we would like to optimize on the codeword 



apriori probabilities subject to certain constraints: 



A. Continuous Alphabet channels 



Minimize — Li n 2 


subject to, 


- Li ™i log 2 TTi = 


NR 


Li^i = 


1 and 




0, Vf 



(16) 



where, N is the block-length of the code and R is its 
information rate in (bits/use). If we set NR = log 2 M, the 
only feasible solution is 71; = -k. This choice of apriori 
is also justified by the Channel Coding Theorem for DMC, 
where an equally likely selection of codewords is shown to 
achieve channel capacity for an ensemble of random codes. 
With this setting, we get: 



P e > 1 



XJ£P(*)P(yW 2 



(17) 



We now specialize (13) to the case of a discrete 
memoryless channel. Recall that, for a discrete memoryless 
channel which is discrete in time, 

P(y\*) = Y\Vy\x(yn\Xn) 
n 

By the proof of the Channel Coding Theorem [Sha48], we 
know that for random ensembles of codes where codewords 
are chosen such that each symbol is chosen independently 
of each other using a probability distribution given by p x (-), 
the ensemble average probability of decoding (under the 
suboptimal jointly typical decoding) tends to zero as block- 
lengths tend to infinity. We will also likewise specialize to 
such an ensemble of codes, without any loss of generality. 
For this special class of codes, P(x) = Yin Px{x n ). So, 



> 1 



M 



M 
1 



E \ / E EI Px(x„ ) Py \ x {Vn I X„ ) 2 
y y x n 

E,/n E P( x n)p(yn\Xn) 2 



M 
1 



M 



y V n x n €K 

II E J E P( x n)p(yn\x n ) 2 
» y„ G J V x„ G K 

{ E ,1 E Px(k)Py\ X (j\k) 2 ) (18) 



where JC and J are the input and output alphabets respec- 
tively. In performing the above simplifications, we made 
repeated use of interchanging summation and product. 



Let us define a parameter p as follows: 



P = 21 °§2 E J E Px(k)Py\ X {j\k? 

V g j V keic 



(19) 



It is usual to define [McE02] a continuous alphabet 
channel to be memoryless when for any finite quantization 
of input and output alphabet, the quantized discrete channel 
is memoryless. Under this definition and if we assume that 
the associated probability measures are regular [Fel70], then 
the corresponding result holds for any memoryless channel, 
where the summations are replaced by appropriate Riemann 
integrations. So for well behaved continuous alphabet mem- 
oryless channels, 



Pr > 1 




N 



p(a)p(/S|a) 2 da. df> 



(20) 



G/C 



Here we define p as follows: 



21og 2 



■ G J 



p x (ix)p vlx (p\<x) 2 da df> (21) 



:G K 



B. A Lower Bound on Equivocation 

Earlier we chose M = 2 NR . Thus for either a discrete 
alphabet or a well behaved continuous alphabet memoryless 
channel, 

P e ^ 1-2 _ t( r -p) (22) 



Using Jensen's inequality and (10) we get: 
On combining (22) and (23) we have proved: 



(23) 



Theorem III.l Most codes in the ensemble of capacity 
achieving random codes considered in this section when 
used over a memoryless channel satisfy the lower bound on 
equivocation: 

N 



H(X\Y) > -^(R-p) 



(24) 



Another application of (22) is in upper bounding the 
capacity of memoryless channels. In Appendix III this is 
explored further. Several simple examples are also provided. 



Appendix I 
A Binary Hypothesis Testing Problem 

Example 1.1 Consider the following binary hypothesis test- 
ing problem: 

h 2 : y = n 2 
where n\ is distributed with a pdf given by 

P(y\h 1 )^f ni (z) = 2 (cos(t/2)) 2 e-\ t \ 



having unit variance and 1x2 nas a Gaussian pdf given by 

P(y|/z 2 )=/„ 2 (z) = y|e-^ 2 
and the apriori probabilities are Pr(/i,) = 1. 



prim 




Figure 2: The aposteriori probabilities corresponding to the two 
hypothesis when 7 = g- The decision region boundaries 
are marked by the crossings of the two plots. 

From Figure 2, we can see that the optimum decision region 
for this problem is very difficult to compute in general. As 
a result the exact Bayes risk is also difficult to obtain, and 
tight bounds on P e are of interest. There are no tightness 
guarantees for either the Bhattacharyaa or Chernoff bound, 
while the harmonic bound of (4) is very tight as can be 
observed in Figure 3. 
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Figure 3: Bounds on Bayes risk in the two hypothesis testing 
problem. 



Appendix II 
Proof of Lemma II. 1 

(i) Our proof is by mathematical induction. 

(1) M = 1 : ci\ = 1 is the only possibility, and claim 
is obvious. 

(2) M = m + 1 : Let us hypothesize that the claim 
is true for M = m. We now prove the M = 
(m + 1) case. Let us use the notation, }ig(a) = 
max| =1 {a,}. 

Hm+lis) = max{/i m (a), a m+ i} 



Consider the normalized sequence, a': = -r^r — = 
■r-^J- — . We can safely take fl m+ i ^ 1, for oth- 
erwise, the claim is trivially true. By induction 
hypothesis, 

r~m 

Ml) < WEK-) 2 



This gives us ]i m (a) ^ y S=l fl ?' ^° we are c ' one 
if we prove that, 









m+l 


max < 














i=l 



But we know that x,y ^ max {x,y} ^ \J x 1 + y 2 
by considering each case separately. 

(ii) Clearly, max, {a,} = max,- {«,} ■ a-, ^ YU 

(iii) We need to only observe that 2(1 — x) ^ 1 — x 2 . ■ 

Appendix III 
An Upper Bound on Mutual Information and 
Capacity 

By observing the bound of (22), we see that the bound 
is trivial whenever R ^ p. However, when R > p, 
P e — > 1 exponentially. On the other hand, for the ensemble 
of codes we considered, the Channel Coding Theorem 
says that the ensemble probability of error can be made 
arbitrarily small, using even the suboptimal jointly typical 
decoding algorithm at the decoder whenever rate R is below 
the mutual information between channel input and output. 
Therefore we have proved the following upper bound on 
mutual information (and hence the capacity) of a memoryless 
channel: 

I(X;Y) p M . } (25) 
C = max{/(X;Y)} s£ max(p„ () } (26) 

where p is given by either (19) or (21). 

A. Discussion and Some Examples 

The practical usefulness for the derived upper bound 
depends on two factors, namely the tightness of the bound 
and the ease of computation. In the derivation of the upper- 
bound for mutual information, the only loss in tightness is in 
the use of Jensen's inequality during the ensemble averaging 
process. The function p has to be maximized over all possible 
input distributions p x (-) to obtain the upper bound. The 
required optimization can make the computation of the bound 
difficult. However, the expression is considerably simpler 
than the expression for mutual information and may be easier 
to deal with for some particular channel. 

Below, results are presented for some very common 
channels. The tightness of the upper bound on capacity is 
found to be acceptable. 




1) Binary Symmetric Channels: For a BSC with 
crossover probability p, using the capacity achieving input 
distribution we get, p = 1 + log 2 (p 2 + (1 — p) 2 ) and the 
capacity is well known to be C = 1 — fyip), where H2(.) 
is the binary entropy function. See Figure 4(a). 

2) Binary Erasure Channels: For a BEC with proba- 
bility of erasure e, again using the capacity achieving input 
distribution we get p = 21og 2 — — whereas, 
the capacity is given by C = 1 — e. Both are shown in 
Figure 4(b). 

3) Binary Input - zero-mean AWGN - Soft Output 
Channel: For a memoryless channel with binary input ( ±1 ) 
and soft output and affected by additive white Gaussian noise 
of zero-mean, using the capacity achieving input distribution 
of [j, i] probability, we get, 



(a) BSC with crossover probability of p. 




p = — log 2 471(7" 



21og 2 



and the capacity is given by: 



C = -2 loS2 ( 2necr 



_ (y-i> 
e 2v z 



(y-i) z 



(y+i)- 



(y+y 



lo g 2 



_ (y-i)- 

e 2tr 2 



_ (y+i)- 



dy 



(b) BEC with probability of erasure e. 




(c) Binary input, AWGN, soft output channel. The binary inputs 



are ( ±1 ) and AWGN has the distribution, M(0, 



N 



Figure 4: Actual capacity and upper bounds on the capacity for some 
common channels. 



where, the noise is distributed as 7V(0,u 2 ) with variance 
u 2 = =31. In this case, the numerical integration required 
for computing the capacity is unstable at very low E^/No, 
due to the presence of the log(-) function in the integrand. 
However, the upper bound integration remains stable up to 
a much lower E^/No- See Figure 4(c). 

The definition of capacity or mutual information was 
not needed in the derivation of the capacity bound in this 
section because of the use of the random coding argument. 
It was pointed out by Prof. Shlomo Shamai (Shitz) that it is 
possible to derive the above bound using only the functional 
definition of mutual information (1) and Jensen's inequality. 
For most practical applications, a tightness guarantee is also 
desirable. 

Acknowledgments: The authors wish to thank Prof. 
Shlomo Shamai (Shitz) for pointing out a simple alternate 
derivation of the capacity bound. 
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